Mask Attention Visualization Demo

This Demo is a visualization of the attention map of masked words and image regions.

Settings

The next two code cells are the settings of model and visualize method.

We use rosita-base.pkl as the default ckpt.

The show_maskatt_img function is the core for visualization. As stated in our paper, we first extract the multi-head attention maps from the last MSA block of the pretrained model, then perform element-wise addition over different heads to obtain one attention map, and finally visualize the row(s) of specific attention map with respect to different masked token(s).

We provide some paired image-text examples from the MSCOCO validation set to run the following visualizations, which are not used during the pretraining. For each image, we provide pre-extracted bottom-up-attention region features (with 36 detected objects). Ecah image is associated with five captions.

The images are in the demo/images folder. Here are the file names of these images:

And their corresponding extracted region features and captions are stored in a *.jpg.npz homonymic file in the folder demo/features.

To use show_maskatt_img, you need to specify the input img_name, which is one of these eight image names, mask_side ('text' or 'img'), mask_id, a list of masked word/region ids, text_id or text.

When masked one word (region), input one masking index in the mask_id field, and when masked multiple words (regions), just input a list of masking indices in the mask_id field.

The text_id should be within [0, 4], corresponding to one of the five captions of the given image. If you want to provide caption for the image by yourself, you can use the text field to input the custom caption.

Now, let's start!

Mask One Word

Here are the examples of visualizing attention map when masked one word.

Mask Multiple Words

Here are the examples of visualizing attention map when masked multiple words.

The text has multiple masked words, and each masked word has an attention map.

Mask One Word with Custom Text

Here are the examples of visualizing attention map when masked one word with custom text.

Mask Multiple Words with Custom Text

Here is the example of visualizing attention map when masked multiple words with custom text.

Mask One Region

Here are the examples of visualizing attention map when masked one region.

Mask Multiple Regions

Here are the examples of visualizing attention map when masked multiple regions.

The image feature has multiple masked region features, and each masked region has an attention map of text.

Mask One Region with Custom Text

Here is the example of visualizing attention map when masked one region with custom text.

Mask Multiple Regions with Custom Text

Here is the example of visualizing attention map when masked multiple regions.